|
Oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented). Oversampling and undersampling are opposite and roughly equivalent techniques. They both involve using a bias to select more samples from one class than from another. The usual reason for oversampling is to correct for a bias in the original dataset. One scenario where it is useful is when training a classifier using labelled training data from a biased source, since labelled training data is valuable but often comes from un-representative sources. For example, suppose we have a sample of 1000 people of which 66.7% are male (perhaps the sample was collected at a football match). We know the general population is 50% female, and we may wish to adjust our dataset to represent this. Simple ''oversampling'' will select each female example twice, and this copying will produce a balanced dataset of 1333 samples with 50% female. Simple ''undersampling'' will drop some of the male samples at random to give a balanced dataset of 667 samples, again with 50% female. There are also more complex oversampling techniques, including the creation of artificial data points. == See also == * Sampling (statistics) * Oversampling in signal processing, which has no relation. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Oversampling and undersampling in data analysis」の詳細全文を読む スポンサード リンク
|